Skip to content

Stabilize diskless no-drop replication test#3511

Merged
zuiderkwast merged 7 commits intovalkey-io:unstablefrom
sarthakaggarwal97:daily-repl-rdb-child-timeout-20260414
Apr 21, 2026
Merged

Stabilize diskless no-drop replication test#3511
zuiderkwast merged 7 commits intovalkey-io:unstablefrom
sarthakaggarwal97:daily-repl-rdb-child-timeout-20260414

Conversation

@sarthakaggarwal97
Copy link
Copy Markdown
Contributor

@sarthakaggarwal97 sarthakaggarwal97 commented Apr 15, 2026

This deflakes all variants of diskless replicas drop during rdb pipe.

The main issue turned out to be that the test was too sensitive to timing and log ordering under TLS, not that the core behavior was wrong. This keeps the same five subcases (no, slow, fast, all, timeout) but makes them much less CI-fragile.

CI passes 200 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.38%. Comparing base (6444717) to head (fe1d7e6).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3511      +/-   ##
============================================
- Coverage     76.58%   76.38%   -0.20%     
============================================
  Files           159      159              
  Lines         80019    80019              
============================================
- Hits          61283    61125     -158     
- Misses        18736    18894     +158     

see 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Nikhil-Manglore
Copy link
Copy Markdown
Member

Nikhil-Manglore commented Apr 15, 2026

Did this test fail recently, I know viktor attempted to fix it in this PR: #3461

Edit: I just saw the comments on the PR, looks like it did fail again recently.

Copy link
Copy Markdown
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Only the "no replicas drop" case is covered, so the other cases can still have timing issues, such as "fast" and "slow" cases? I see there are some special cases for the other cases, for example for "timeout" there is pause_process as well. Do you have a full picture?

Comment thread tests/integration/replication.tcl Outdated
Comment thread tests/integration/replication.tcl Outdated
Comment thread tests/integration/replication.tcl Outdated
@sarthakaggarwal97 sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 14afd0b to 5a24b23 Compare April 16, 2026 16:45
@sarthakaggarwal97
Copy link
Copy Markdown
Contributor Author

sarthakaggarwal97 commented Apr 16, 2026

@zuiderkwast 100 runs for all the variants passed: https://github.com/sarthakaggarwal97/valkey/actions/runs/24521848934

@sarthakaggarwal97
Copy link
Copy Markdown
Contributor Author

sarthakaggarwal97 commented Apr 17, 2026

@zuiderkwast I think it still deflakes the tests a lot, but I am afraid out of 500, I still see 1-5 flaky runs.

@sarthakaggarwal97 sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 56ae8dc to 1e467b0 Compare April 17, 2026 04:04
@sarthakaggarwal97
Copy link
Copy Markdown
Contributor Author

This version is quite stable. Not seeing failures anymore.

@madolson madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 17, 2026
Copy link
Copy Markdown
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good! Sorry for the delay.

Comment thread tests/integration/replication.tcl
sarthakaggarwal97 and others added 7 commits April 21, 2026 09:54
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
The 'slow' subcase fallback was searching for a log message matching
'*Connection with replica client id * lost.*' but the actual server log
format is 'Connection with replica <host>:<port> lost.' — there is no
'client id' in the message. Fix the glob to '*Connection with replica * lost.*'.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
On fast ARM CI runners the RDB transfer completes before both replicas
are killed, so the primary logs '2 replicas still up' instead of
'last replica dropped' or '1 replicas still up'. Add a nested catch
fallback to accept all three possible outcomes.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
@sarthakaggarwal97 sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 9a86d16 to fe1d7e6 Compare April 21, 2026 16:58
Copy link
Copy Markdown
Member

@Nikhil-Manglore Nikhil-Manglore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sarthakaggarwal97
Copy link
Copy Markdown
Contributor Author

sarthakaggarwal97 commented Apr 21, 2026

The test failures are unrelated to this change. Related to #2936

@zuiderkwast zuiderkwast merged commit 03c2d4c into valkey-io:unstable Apr 21, 2026
63 of 66 checks passed
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 23, 2026
This deflakes all variants of `diskless replicas drop during rdb pipe`.

The main issue turned out to be that the test was too sensitive to
timing and log ordering under TLS, not that the core behavior was wrong.
This keeps the same five subcases (no, slow, fast, all, timeout) but
makes them much less CI-fragile.

CI passes 200 times:
https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants